Speed without drag¶

Saul Diez-Guerra

What's the story here?¶

How to speed up your numerical Python when there is no time to waste.¶

Think first..¶

TANSTAAFL!¶

there ain't no such thing as a free speed-up

..and then don't try.¶

What if...¶

You thought it through and refactored

You threw money at the problem

You evaluated the cost of opportunity

~~You blamed it on the GIL and on van Rossum, and switched to Java~~

Ok, fine.¶

But first¶

Rule out other problems

IO, deserializing, any other bottleneck. I can't help you otherwise.

Calculating something over and over (e.g. re-compiling a reg ex).

Are you discarding any computation?

Different types of randoms, different speeds… how much entropy do you need?

Many other things...

How to speed up a program 114kx, David Schachter - https://www.youtube.com/watch?v=e08kOj2kISU

But first¶

Rule out other problems
Capture low-hanging fruit

Multiprocessing (overhead and Amdahl's)

Check your use of immutable objects, temporary storage: e.g. string concat

Obscure language features like __ slots __

Fast append comprehensions

Generators

But seriously, don't use his code

But this is not the point of the talk¶

Libraries and tools that help you circumvent The Safety of Python!

many tools, such speed¶

Exhausting CPython
SciPy
NumPy
Numba
Parakeet
NumExpr
Theano
PyPy
NumPyPy
Jython
Blaze
Pyston
Pythran

Driving example¶

Polygon area based on its coordinates

DISCLAIMER¶

*Time measurements are just for reference and are a biased, reductionistic tool that shouldn't be blindly trusted.*

*Especially when you use a single piece of code to test many methods of optimization at once.*

Vanilla CPython¶

In [49]:

"""
XX
 XX
  XX
   XX
"""

def diag_gen(length):
    points = [(0.25, 0.25)]
    for _ in xrange(1, length + 1):
        points.append((points[-1][0] + 1, points[-1][1]))
        points.append((points[-1][0], points[-1][1] + 1))
    points.append((points[-1][0], points[-1][1] + 1))
    points.append((points[-1][0] - 1, points[-1][1]))
    points.append((points[-1][0], points[-1][1] - 1))
    for _ in xrange(1, length):
        points.append((points[-1][0] - 1, points[-1][1]))
        points.append((points[-1][0], points[-1][1] - 1))
    points.append((0.25, 0.25))

    return points

In [51]:

def test_diag_gen(points):
    res = 0
    for point in points:
        assert abs(point[0] + point[1] - res) <= 1
        res = point[0] + point[1]
    assert points[0] == points[-1]
test_diag_gen(diag_gen(200))

In [53]:

def area(points):
    acc = 0
    for i in xrange(len(points) - 1):
        acc += points[i][0] * points[i + 1][1] - points[i + 1][0] * points[i][1]
    return abs(acc) / 2

assert area(diag_gen(32)) == 64.0

In [54]:

polygon = diag_gen(2*10**5)
%timeit area(polygon)

1 loops, best of 3: 252 ms per loop

SciPy¶

SciPy, other than the "stack" and the "ecosystem" that contains amazing things as SciPy (hah), NumPy, IPython, pandas, PyTables, matplotlib... it is also the name of a package filled with algorithms and convenience functions.

http://github.com/scipy/scipy

NumPy¶

N-dimesional arrays. C and fortran for speed, just like SciPy. "Part" of SciPy, but packaged on its own because of its standalone utility.

https://github.com/numpy/numpy

In [55]:

import numpy as np
polygon_np = np.array(polygon, dtype=np.float64)

In [56]:

def area_np(arr):
    return abs((arr[:-1,0] * arr[1:,1] - arr[:-1,1] * arr[1:,0]).sum()) / 2
assert area_np(polygon_np) == 4.*10**5

In [59]:

print " Python Area:".ljust(21),
%timeit area(polygon)

print "Numpy Area:".ljust(20),
%timeit area_np(polygon_np)

 Python Area:        1 loops, best of 3: 257 ms per loop
 Numpy Area:         100 loops, best of 3: 8.84 ms per loop

Numba¶

JIT compiler that leverages the LLVM project
AST -> LLVM IR for a targeted subset of Python and NumPy
Experimental CUDA. Static compilation through pycc
NumbaPro: multicore and GPU

https://github.com/numba/numba

PyCon ES 2013 - Mi primer numba

https://github.com/diezguerra/pycon-spain-numba-talk

https://www.youtube.com/watch?v=0DHZaSlTBEM

In [60]:

import numba as nb

area_nb = nb.jit(area)
area_np_nb = nb.jit(area_np)

In [62]:

print " Python Area:".ljust(21),
%timeit area(polygon)
print "Python Area Numba:".ljust(20),
%timeit area_nb(polygon)

print "Python NumPy:".ljust(20),
%timeit area_np(polygon_np)
print "Python NumPy Numba:".ljust(20),
%timeit area_np_nb(polygon_np)

 Python Area:        1 loops, best of 3: 257 ms per loop
 Python Area Numba:  1 loops, best of 3: 404 ms per loop
 Python NumPy:       100 loops, best of 3: 8.68 ms per loop
 Python NumPy Numba: 100 loops, best of 3: 8.56 ms per loop

(unrelated to the shoelace example)¶

In [63]:

@nb.vectorize([nb.i8(nb.i8), nb.i4(nb.i4)])
def power(x):
    return x * x + 1
power_np = np.vectorize(lambda x: x * x + 1)


arr = np.arange(1000000)
%timeit -n 5 power_np(arr)
%timeit -n100 np.square(arr) + 1
%timeit -n100 power(arr)

5 loops, best of 3: 213 ms per loop
100 loops, best of 3: 5.58 ms per loop
100 loops, best of 3: 5.06 ms per loop

Parakeet¶

Similar to Numba originally, development is now focusing on manual optimization of functions and it has gone from translation into LLVM (deprecated) to C and OpenMP (gcc).

Experimental support for CUDA.

https://github.com/iskandr/parakeet

In [65]:

import parakeet
area_pk = parakeet.jit(area)
area_np_pk = parakeet.jit(area_np)
assert area_pk(polygon) == 4.*10**5
assert area_np_pk(polygon_np) == 4.*10**5

In [66]:

%timeit area_pk(polygon)
%timeit area_np_pk(polygon_np)

1 loops, best of 3: 491 ms per loop
1000 loops, best of 3: 1.28 ms per loop

NumExpr¶

Feeds the processor large arrays (larger than its cache) in a piecemeal fashion
Avoids temporary storage of intermediate values
Support for OpenMP and Intel's VML / MKL for transcendent expressions.
Reduced yet useful set of expressions, but highly performant (5-10x vs numpy)

https://github.com/pydata/numexpr

In [16]:

import numexpr as ne
def area_ne(arr):
    arrX, arrx, arrY, arry = arr[:-1,0], arr[1:,0], arr[:-1,1], arr[1:,1]
    return abs(ne.evaluate('sum(arrX * arry - arrY * arrX)'))
assert area_ne(polygon_np) == 4.*10**5

In [73]:

%timeit area_np(polygon_np)
%timeit area_ne(polygon_np)

100 loops, best of 3: 8.87 ms per loop
100 loops, best of 3: 2.54 ms per loop

Theano¶

Defines, optimizes and evaluates array expressions. Seamless GPU. CUDA. Native optimizations. Really specialized.

Some tricky dependencies, depends on the platform. Anaconda will solve it.

https://github.com/Theano/Theano

In [18]:

import os
os.environ['DYLD_FALLBACK_LIBRARY_PATH'] = '/Users/saul/anaconda/lib:' + os.environ.get('DYLD_FALLBACK_LIBRARY_PATH', '')

In [19]:

from theano import tensor as T
from theano import function

def area_th(arr):
    x, y, z, j = T.vectors('x', 'y', 'z', 'j')
    expr = T.sum((x * y - z * j) * 0.5)
    
    return abs(function([x, y, z, j], expr)(arr[:-1,0], arr[1:,1], arr[:-1,1], arr[1:,0]))

assert area_th(polygon_np) == 400000.

In [44]:

%timeit -n 10 area_np(polygon_np)
%timeit -n 10 area_ne(polygon_np)
%timeit -n 10 area_th(polygon_np)

10 loops, best of 3: 10.6 ms per loop
10 loops, best of 3: 2.53 ms per loop
10 loops, best of 3: 41.5 ms per loop

PyPy¶

JIT compiler following the Python spec, but not (C)Python. Pretty mature, lots of support, not entirely there yet in some regards but highly compatible.

http://pypy.org/

pypy » ./bin/pypy shoelace.py

ta: 0.029669 s

cpython » python shoelace.py

pp: 2.539226 s

NumPyPy¶

Fork of the original to make it work. http://buildbot.pypy.org/numpy-status/latest.html

pypy » ./bin/pypy shoelace.py

pp: 0.030495 s

pypy » ./bin/pypy shoelace_np.py

np: 0.213656 s

cpython » python shoelace.py

pp: 2.539226 s

cpython » python shoelace_np.py

np: 0.077828 s

In [74]:

from timeit import Timer

polygon_3 = diag_gen(2*10**5)
polygon_3_np = np.array(diag_gen(2*10**5))
ta = Timer('area(polygon_3)', 'from __main__ import area, polygon_3')
tanp = Timer('area_np(polygon_3_np)', 'from __main__ import area_np, polygon_3_np')
print "ta:\t", round(min(ta.repeat(3, 10)), 6), "s"
print "tanp:\t", round(min(tanp.repeat(3, 10)), 6), "s"

ta:	2.590854 s
tanp:	0.085523 s

Jython¶

Python on the JVM
Needs JDK

http://www.jython.org/

jython » ./bin/jython shoelace.py

ta: 2.229 s

No numpy :(

In [1]:

""""

python: 2.162128 s

jython: 1.967000 s

npypy : 0.213656 s

numpy : 0.077828 s

pypy  : 0.030495 s

""";

Pyston¶

New JIT compiler of Python 2.7 by Dropbox
No tracing
Conservative garbage collection
Built on top of LLVM
AST, LLVM IR, LLVM optimizer/JIT.
- AST to IR branching with type prediction + fall back.
Really early stage, lots of parts missing. OS X doesn't work yet.

https://github.com/dropbox/pyston

Blaze¶

Cloud, big data infrastructure and analytics library

Computing large, distributed, heterogeneous datasets
Adapters and helpers to deal with Big Data and The Cloud.
LLVM

https://github.com/ContinuumIO/blaze/

Pythran¶

~2 year old project
Subset of Python to modern C++ transpiler through comment annotations, including types
Multi-core, SIMD instructions
AST parsing, optimization

https://github.com/serge-sans-paille/pythran

Mehdi Amini, PyData SV 2013 - http://www.youtube.com/watch?v=KKoVeiQOmik

In [22]:

#pythran export dprod(int list, int list)
def dprod(l0,l1):
    return sum(x*y for x,y in zip(l0,l1))

And when all else fails...¶

C extensions, Cython, Ctypes, CFFI.

Samy Vilar - http://github.com/samyvilar/poly_area.git

Recap¶

TANSTAAFL

Think first

Use the best tool for the job, while using the second best tool for any job

Speed without drag¶

What's the story here?¶

How to speed up your numerical Python when there is no time to waste.¶

Think first..¶

TANSTAAFL!¶

..and then don't try.¶

What if...¶

Ok, fine.¶

But first¶

But first¶

But this is not the point of the talk¶

many tools, such speed¶

Driving example¶

DISCLAIMER¶

Vanilla CPython¶

SciPy¶

NumPy¶

Numba¶

(unrelated to the shoelace example)¶

Parakeet¶

NumExpr¶

Theano¶

PyPy¶

NumPyPy¶

Jython¶

Pyston¶

Blaze¶

Pythran¶

And when all else fails...¶

Recap¶

Qualifications on debt¶